Textual Characteristics for Language Engineering
نویسندگان
چکیده
Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the field of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and defining a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientific community to provide feedback and help to establish a good practice of corpus-aware evaluations.
منابع مشابه
Textual Characteristics of Different-sized Corpora
Recently, textual characteristics, i.e. certain language statistics, have been proposed to compare corpora originating from different genres and domains, to give guidance in language engineering processes and to estimate the transferability of natural language processing algorithms from one corpus to another. However, until now it is unclear how these textual characteristics behave for differen...
متن کاملTextual Metadiscourse Resources in Research Articles*
This study was motivated by three factors, which also contribute to its significance for today’s academic writing. First, research articles are the significant means of communication between the writers all over the world. Second, persuasion and organization are crucial notions in academic writing where the authors have to consider the academic audiences and their needs. Third, some writers ar...
متن کاملPatterns for Identifying and Structuring Features from Textual Descriptions: An Exploratory Study
Software Product Line Engineering (SPLE) supports developing and managing families of similar software products, termed Software Product Lines (SPLs). An essential SPLE activity is variability modeling which aims at representing the differences among the SPL’s members. This is commonly done with feature diagrams – graph structures specifying the user visible characteristics of SPL’s members and...
متن کاملThe Effect of Visual Representation, Textual Representation, and Glossing on Second Language Vocabulary Learning
In this study, the researcher chose three different vocabulary techniques (Visual Representation, Textual Enhancement, and Glossing) and compared them with traditional method of teaching vocabulary. 80 advanced EFL Learners were assigned as four intact groups (three experimental and one control group) through using a proficiency test and a vocabulary test as a pre-test. In the visual group, stu...
متن کاملExploiting extra-textual and linguistic information in keyphrase extraction
Cím: Exploiting extra-textual and linguistic information in keyphrase extraction
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012